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Abstract 

We consider the problem of approximate 
Bayesian inference in log-supermodular models. 
These models encompass regular pairwise MRFs 
with binary variables, but allow to capture high- 
order interactions, which are intractable for ex¬ 
isting approximate inference techniques such as 
belief propagation, mean field, and variants. We 
show that a recently proposed variational ap¬ 
proach to inference in log-supermodular models 
-L-Field- reduces to the widely-studied min¬ 
imum norm problem for submodular minimiza¬ 
tion. This insight allows to leverage power¬ 
ful existing tools, and hence to solve the vari¬ 
ational problem orders of magnitude more effi¬ 
ciently than previously possible. We then pro¬ 
vide another natural interpretation of F-Field, 
demonstrating that it exactly minimizes a spe¬ 
cific type of Renyi divergence measure. This in¬ 
sight sheds light on the nature of the variational 
approximations produced by F-FlELD. Further¬ 
more, we show how to perform parallel inference 
as message passing in a suitable factor graph at a 
linear convergence rate, without having to sum 
up over all the configurations of the factor. Fi¬ 
nally, we apply our approach to a challenging 
image segmentation task. Our experiments con¬ 
firm scalability of our approach, high quality of 
the marginals, and the benefit of incorporating 
higher-order potentials. 


1. Introduction 

Performing inference in probabilistic models is one of the 
central challenges in machine learning, providing a foun¬ 
dation for making decisions with uncertain data. Unfor¬ 
tunately, the general problem is intractable and one must 
resort to approximate inference techniques. The impor¬ 
tance of this problem is witnessed by the amount of interest 
it has attracted in the research community, which has re¬ 
sulted in a large family of approximations, most notably the 


mean-field (Wainwright & Jordan, 2008) and belief propa¬ 
gation (Pearl, 1986) algorithms and their variants. One ma¬ 
jor drawback of these and many other techniques is the ex¬ 
ponential dependence on the size of the largest factor which 
restricts the class of models one can use. In addition, these 
methods generally involve non-convex objectives, resulting 
in local optima (or even non-convergence). 

We consider the problem of inference in distributions over 
sets, also known as point processes. Formally, we have 
some finite ground set V and a measure P that assigns 
some probability P{A) to every subset A C V. We would 
like to point out that we can equivalently see such distribu¬ 
tions as being defined over |U| Bernoulli random variables 
Xi G {0,1}, one for every element in the ground set i G U 
indicating if element i has been selected. As a concrete ex¬ 
ample showing this equivalence consider the task of image 
segmentation in computer vision, where one wants to sepa¬ 
rate the foreground from the background pixels. Tradition¬ 
ally, one defines one random variable Xp G {0,1} for each 
pixel p indicating if the pixel is in the foreground or the 
background. We can also isomorphically treat the distribu¬ 
tion as being defined over subsets of the set of all pixels V. 
In this case, for any subset A C U the quantity P(A) is the 
probability of pixels A belonging to the foreground. In the 
remaining of the paper we will employ this latter view of 
distributions over sets. The additional assumption that we 
make is that the distribution is log-supermodular, i.e. can 
be written as P{A) = ^ exp(—F(A)), where F is some 
submodular function. 

Related work. Submodular functions are a family of set 
functions exhibiting a natural diminishing returns prop¬ 
erty, originating first in the field of combinatorial opti¬ 
mization (Edmonds, 1970). They have been applied to 
many problems in machine learning, including cluster¬ 
ing (Narasimhan et al., 2005), variable selection (Krause 
& Guestrin, 2005), structured norms (Bach, 2010), dictio¬ 
nary learning (Cevher & Krause, 2011), etc. Submodular 
functions have huge implications for the tractability of (ap¬ 
proximate) optimization, akin to convexity and concavity 
in continuous domains. While the major emphasis has con¬ 
sequently been on optimization, submodular functions can 
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be also employed to define probabilistic models. Special 
cases include Ising models used in computer models and 
the determinantal point process (DPP) (Kulesza & Taskar, 
2012) used for modeling diversity. Alas, submodularity 
does not render the inference problem tractable, which re¬ 
mains extremely difficult even for the Ising model (Gold¬ 
berg & Jerrum, 2007; Jerrum & Sinclair, 1993) which has 
only pairwise interactions. 

The study of approximate Bayesian inference in general 
log-supermodular models has been recently initiated by 
Djolonga & Krause (2014). They provide a general 
variational approach -L- Field- that optimizes bounds on 
the partition function via the differentials of submodular 
functions. While their approach leads to optimization 
problems that can be solved exactly in polynomial time for 
arbitrary high order interactions, presently the approach is 
slow, and impractical for large scale inference tasks such 
as those arising in computer vision. 

Our contributions. We improve over their result in sev¬ 
eral ways. First, by showing an equivalence of L-Field 
with a classical problem in submodular minimization - the 
minimum norm point problem - we obtain access to a large 
family of specially crafted algorithms that can handle mod¬ 
els with very large numbers of variables. In the experimen¬ 
tal section we indeed perform inference over images, which 
have hundreds of thousands of variables. This insight also 
implies, for example, that the approximation agrees on 
the mode of the distribution, hence the MAP problem is 
solved for free. Secondly, by establishing another impor¬ 
tant connection, namely to a specific type of information 
divergence, we shed light on the type of approximations 
that result from this method. Thirdly, we show how spe¬ 
cial structure of many real-world log-supermodular models 
(such as those in image segmentation with high-order po¬ 
tentials) enable a highly efficient parallel message passing 
algorithm that converges to the global optimum at a lin¬ 
ear rate. Lastly, we perform extensive experiments on a 
challenging image segmentation task, demonstrating that 
our approach is scalable, provides more accurate marginals 
than existing techniques, and provides evidence on the ef¬ 
fectiveness of models using high-order interactions. 

2. Background: Submodularity and 
log-supermodular models 

Formally, a function F : 2^ ^ M is said to be submodular 
if for any pair of sets A C B and x ^ B it holds that 

F{{x} U A) - F{A) > F{{x} UB)- F{B). 

In other words, the gain of adding any element x de¬ 
creases as the context grows, which is the diminishing re¬ 
turns property already mentioned. Additionally, without 



Figure 1: Generated superpixels to be used as attractive 
higher order potentials for encouraging label consistency. 

any loss of generality we assume that F is normalized so 
that F(0) = 0. We will consider Gibbs distributions that 
arise from these models, more specifically probability mea¬ 
sures of the form 

P(5) = |exp(-F(5)), 

for some submodular F : 2^ ^ M. These models are 
called log-supermodular or attractive for reasons explained 
below. 

Examples. A typical example of such models is the reg¬ 
ular Ising model, which can be used for the image segmen¬ 
tation task from the introduction. Continuing with that ex¬ 
ample, we define a set of edges F that connect neighboring 
pixels, and for every pair of neighbors we specify 

a weight > 0 that quantifies their similarity. To 

model the preference of neighbors to be assigned to the 
same segment, we use the cut function 

VA C V. F(^A}j ^ ^ l|An{p,p'}|=i'^{p,p'}• 

{pF}^e 

Hence, if we place two neighboring pixels p and p' in dif¬ 
ferent segments, we will cut the edge {p,p'} and be “pe¬ 
nalized” by the corresponding weight, which explains the 
attractive behavior of the model. We can go one step further 
and define regions Pi FV which we would prefer to be in 
the same segment. One strategy to generate the regions, 
used by Kohli et al. (2009) is to generate superpixels, as 
illustrated on Figure 1 . We can then modify the model to 
incorporate these higher order potentials by adding terms 
of the form (/)(|Fi fl A|/|F^|) for some concave function 
As a concrete example, consider = z{l — z), which 
assigns a value of 0 when the pixels in the superpixel are in 
the same segment, and assigns a larger penalty otherwise, 
which is maximal when the pixels are equally split between 
the two segments. 

Modular functions. The simplest family of submodular 
functions are modular functions, which can be seen as the 
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discrete analogue of linear functions. Namely, a function 
s : 2^ ^ M is said to be modular if for all A C V it 
holds that s{A) = family of distribu¬ 

tions that arise from these functions are exactly the family 
of completely factorized distributions ^ because 

P{S) oc exp(-s(S')) = P[exp(-Si). 

ies 

As evident from their definition, modular functions are 
uniquely defined through the quantities s({i}) for all i G 
V. It is very useful to view modular functions as vectors 
s G with coordinates Si = 


One can extract the solution to the submodular minimiza¬ 
tion problem from the solution to the above problem by 
thresholding, which is formalized in the following theorem. 

Theorem 1 (Fujishige (2005)). ^s* is the optimal solution 
to problem (3), define the following sets 

A_ = {v \v and 5* < 0}, and 

Aq = {v \v and 5* < 0}. 

Then A_ and Aq are the unique minimal and maximal min- 
imizers of F. 

3. Variational inference with L-Field 


Submodular polyhedra. There are several polyhedra 
that contain some of these modular functions (in their vec¬ 
torial representation) that we will make use of. More 
specifically, we are interested in the submodular polyhe¬ 
dron P{F) and the base polytope B{F), which are defined 
as 

P(F) = {s G I VA C 12: s{A) < F(A)}, (1) 

B{F) = P{F) n {s G I s{V) = F(V)}. (2) 

In other words, P{F) is the set of all modular lower bounds 
of the function F, while B{F) adds the further restriction 
that the bound must be tight at the ground set V. It can be 
shown that these polyhedra are not empty and their geom¬ 
etry is also well understood (Fujishige, 2005; Bach, 2013). 
Moreover, what is especially surprising, is that even though 
B{F) is defined with exponentially many inequalities, we 
can optimize linear functions over it in 0( | V| log | V|) time 
with a simple greedy strategy (Edmonds, 1970). 


The main barrier to performing inference in log- 
supermodular models is the computation of the normaliz¬ 
ing factor Z, also known as the partition function in the 
statistical physics literature. We cannot compute it directly 
as we have to sum up over all S' C V, so we have to use 
approximative techniques. One common approach is to de¬ 
fine an optimization problem over some variational param¬ 
eter q, so that we can compute the quantity of interest by 
optimizing this problem. 

We now review the variational approximation technique re¬ 
cently introduced by Djolonga & Krause (2014). Their 
method relies on two main observations: (i) modular func¬ 
tions have analytical log-partition functions and (ii) sub¬ 
modular functions can be lower-bounded by modular func¬ 
tions. The main idea is the following: if it holds that 
VA C V : s{A) < F{A), then it will certainly be the case 
that 

log < log Z = ^log{l + e“"0. 

ACV ACV iev 


MAP estimation and the minimum norm point. A very 
natural question that arises for any probabilistic model 
is that of finding a MAP configuration, which for log- 
supermodular distribution amounts to minimizing the func¬ 
tion F. This is a problem that has been studied in much de¬ 
tail and has resulted in numerous approaches. The fastest 
known combinatorial algorithm due to Orlin (2009) has a 
bound of 0{n^ + rn^), where r is the cost of evaluating 
the function, and can be prohibitively expensive to run for 
larger ground sets. An algorithm that performs better in 
practice, but only has a pseudopolynomial running time 
guarantee (Chakrabarty et al., 2014), is the Fujishige-Wolfe 
algorithm (Fujishige, 1980). This method approaches the 
problem by solving the following convex program, known 
as the minimum norm problem. 

minimize llsIp. (3) 

sGB(F) 

^Because we use Gibbs distributions, note that they can not 
assign zero probabilities. 


We thus have a family of variational upper bounds on the 
partition function parametrized by the modular functions s, 
over which we can optimize to minimize the right hand side 
of the inequality. As shown by Djolonga & Krause (2014) 
this variational problem can be reduced to the following 
convex separable optimization problem over the base poly¬ 
tope 


minimize > log (1 + exp (—5A). 

seBiF) ^ 


(4) 


This problem - L-Field - can be then solved using the 
divide-and-conquer algorithm (Bach, 2013; Jegelka et al., 
2013) by solving at most 0(min{|I/|, log ^}) MAP prob¬ 
lems, where e is the tolerated error on the marginals. It can 
be also approximately solved using the Frank-Wolfe algo¬ 
rithm at a convergence rate of 0{l/k). While these results 
establish tractability of the variational approach, in general 
solving even one MAP problem requires submodular min¬ 
imization - an expensive task, and repeated solution may 
be too costly. Convergence of the Frank-Wolfe method is 
empirically slow. 
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4. L-Field = Minimum norm point. 

Our first main contribution is the following, perhaps sur¬ 
prising, result: 

Theorem 2. Problems (4) and (3) have the same solution. 

The proof of this theorem (given in the appendix) crucially 
depends on the peculiar characteristics of the base poly¬ 
tope. Similar results have been shown (for other objectives) 
by Nagano & Aihara (2012). This theorem has three im¬ 
mediate, extremely important consequences. First, since 
the minimum-norm point approach is often the method 
of choice for submodular minimization anyway, this in¬ 
sight reduces the cost from solving many MAP problems 
to a single minimum norm point problem, which leads to 
substantial performance gains - a factor of 0{\V\) if we 
seek the optimal variational solution! Secondly, given this 
equivalence and Theorem 1, we can immediately see that 
we can in fact extract the MAP solution by thresholding 
the marginals at 1/2. 

Corollary 1. We can extract the unique minimal and max¬ 
imal MAP solutions by thresholding the optimal marginal 
vector atXj^. 

Thus, the L-Field approach results in the exact MAP so¬ 
lution in addition to approximate marginals and an upper 
bound on the partition function. Thirdly, since the min¬ 
imum norm point problem is well studied, faster algo¬ 
rithms for important special cases become available. In 
particular, in §6, we demonstrate how certain types of log- 
supermodular distributions enable extremely efficient par¬ 
allel inference. 

5. The divergence minimization perspective 

The L-Field approach attacks the partition function 
directly. One can of course employ the factorized dis¬ 
tribution parametrized by the minimizer s* of the upper 
bound to obtain approximate marginals. However, it is not 
immediately clear what properties the resulting distribu¬ 
tion has, apart from agreeing on the mode (as shown by 
Corollary 1). To this end, we turn to the theory of diver¬ 
gence measures as that will enable us to understand the 
solutions preferred by the method. Divergence measures 
are functions D{P\\Q) of two probability distributions P 
and Q that quantify the degree of dissimilarity between the 
arguments. Once we have picked a divergence measure D, 
we are interesting in minimizing D{P \ \Q) among some 
set of approximative distributions Q G Q. The family 
which is of particular interest to us is that of completely 
factorized distributions that assign positive probabilities, 
which we now formally define. 


Definition 1. We define the set Q of completely factorized 
positive distributions as 

Q = {Q \ Q{S) (X ]^exp(-g^) for some Cl G MX}. 
ies 

There are many choices for a divergence measure, the most 
prominent examples being the KL-divergence and the fam¬ 
ily of Renyi divergences (Renyi, 1961). Of particular inter¬ 
est for our analysis is the special infinite order of the Renyi 
divergence, defined as follows: 

Definition 2 (Van Erven & Harremoes (2012)). Define the 
Renyi divergence of infinite order between P{S) and Q{S) 

Z)oo(P||Q)=logsup (5) 

scv Q[o) 

In the terminology of Minka et al. (2005) we can see that 
the Doq divergence is inclusive, which means that it would 
try to “cover” as much as possible from the distribution: 
The variational approximation is conservative in the sense 
that it attempts to spread mass over all sets that achieve sub¬ 
stantial mass under the true distribution. As we now show, 
it turns out that when we minimize this divergence for log- 
supermodular distributions we can focus our attention only 
on some specific factorized distributions. 

Lemma 1. When P is log-supermodular, to solve 
minimizeg^Q Doo{P || Q) have to only optimize over 
modular functions q that are global lower bounds of F. 

What this lemma essentially says, is that a minimizing dis¬ 
tribution q* can be always found in P{F). This result also 
implies the central result of this section, that the variational 
approach we have considered essentially minimizes the in¬ 
finite divergence. 

Theorem 3. When P is log-supermodular, the problem 
minimizeg^Q Doo{P || Q) is equivalent to problem (4). 

This theorem has the following immediate consequence: 
Corollary 2. For log-supermodular models, problem 
minimizeg^Q II Q) is polynomial-time tractable 

viaO{\V\) MAP (submodular minimization) problems. 

Hence, any log-supermodular distribution has the prop¬ 
erty that we can find the closest factorized distribution to 
it w.r.t. this specific divergence in polynomial time, irre¬ 
spective of whether the distribution factorizes into smaller 
factors or not. We would like to point out that the above 
criterion does not necessarily hold in general for non-log- 
supermodular distributions, which we formally show. 
Lemma 2. Lemma 1 does not hold for general point pro¬ 
cesses. Specifically, there exists a log-submodular counter 
example. 

The proofs of all claims are provided in the supplemental 
material. 
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6. Parallel inference for decomposable models 

Very often the submodular function F has structure that one 
can exploit to procure faster inference algorithms. In par¬ 
ticular, the function often decomposes, i.e., can be written 
as a sum of (simpler) functions as 

R 

F{S) = Y,Fi{Sr\Vi), 

i—\ 

where Fi : 2^" ^ M are submodular functions with ground 
sets Vi. This setting has been considered, e.g., by Stobbe 
& Krause (2010) and Jegelka et al. (2013). The decompo¬ 
sition implies that the corresponding distribution factorizes 
as follows 

R 

P{S) oc P[exp(-Fi(S'nyi)). (6) 

i=l 


to minimizing the divergence Doo{P || Q), which we now 
briefly describe. The main idea is to approximate each fac¬ 
tor exp(—(Fn Vi)) with a completely factorized distribu¬ 
tion Qi{S) oc exp{—qi{S)), such that the product Y[f=i Qi 
is a good approximation to the true distribution in terms of 
the given divergence. Then, we optimize iteratively using 
the following procedure. 

1. Pick a factor Fi. 

2. Replace the other factors Fj for j ^ i with their ap¬ 
proximations Qj and minimize 

1 ^ 

i?oo(^exp(-Fi(5))nQiiinQi) 

* i=l 

over the factorized approximation Qi . 


In fact, the examples we discussed in §2 both have this 
form, factorizing either into pairwise potentials or into the 
potentials defined by the superpixels. Such models can be 
naturally represented via a factor graph G that has as nodes 
the union of the ground sets Vi and the factors Fi,..., F^^. 
We then add edges F in a bipartite way by connecting each 
factor Fi to the elements Vi that participate in it (e.g. Fi is 
connected to iff G Vi). For any node w in the graph (ei¬ 
ther a factor, or variable node), we will denote its neighbors 
by 6{w). 

When the function enjoys such a decomposition, the base 
polytope can be written as the Minkowski sum of the base 
polytopes of the summands, or formally ^ 

B{F) = J2B{Fi). 


Hence, the minimum norm problem (3) that we are inter¬ 
ested in can be rewritten as the following problem. 


minimize 

ciieB(Fi) 


vEV Fi£S{v) 


In the following, we discuss two natural message passing 
algorithms exploiting this structure. 


In other words, we choose a factor and minimize the infi¬ 
nite divergence for that factor, but instead of using the true 
factors exp(—Fj (S)) for j ^ i, wo replace them with their 
current modular approximations Qj. 

A parallel approach. One downside of the approach pre¬ 
sented above is that it has to be applied sequentially, i.e., 
one factor has to be updated at a time to ensure con¬ 
vergence. An alternative is to apply an approach used 
by Jegelka et al. (2013), which allows to perform message 
passing in parallel without losing the convergence guaran¬ 
tees. Jegelka et al. (2013) assume that all factors depend on 
all variables (i.e. Vi = V). In the following, we generalize 
their setting in order to allow V 7^ V. By changing the dual 
problem they consider (shown in detail in the appendix) we 
arrive at a form that is more natural to our setting and can be 
seen as performing message passing in the factor graph. To 
describe the messages, we have to define the following pair 
of norms that arise from the structure of the factor graph. 

Definition 3. For any G where S C V, we define 
the following pair of norms. 

iixsIIg = Y = Y 

ves ves 


Expectation propagation. A very natural approach 
would be to perform block coordinate descent one factor 
at a time. If we look through the lens of divergence mea¬ 
sures, as introduced in §5, we can make a clear connec¬ 
tion to (a variant of) expectation propagation^, the mes¬ 
sage passing approach of Minka et al. (2005) specialized 

^Ifv ^ Vi, then the elements from B{Fi) are treated as having 
a zero for that coordinate. 

^Typically, expectation propagation is defined w.r.t. the KL- 
divergence. 


The messages from variables to factors are simple sums, 
similar to those in standard belief propagation 

The factors always keep some vector on their base polyhe¬ 
dron, which at iteration t will be denoted by G B{Fi). 
Then, based on the incoming messages, they update this 
vector by solving a convex problem, which is much cheaper 
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than the exhaustive computation one has to do for belief 
propagation (which is exponential in the factor size). We 
will denote the message sent from node u to node w at 
iteration t by ^ the vector of mes¬ 

sages received at iteration t at node Fi (one message from 
each V G Vi), then the factor solves a projection problem 
parametrized by (m^,x^), whose solution is assigned to 
Written formally, we have 

q*+i = argmin Hq^ - (q‘ - m*)||^,. 

qi G-B(Fi) 

As this is a convex separable problem on the base polytope, 
it can be solved for example using the divide-and-conquer 
algorithm (Bach, 2013). Having solved this problem, the 
factor sends the following messages to its neighbours 


Stated differently, it will send to every variable node v 
the coordinate of the stored vector corresponding to that 
variable. At every iteration t we can extract the current 
factorized approximation to the full distribution by simply 
considering the incoming messages at the variable nodes. 
Specifically, the approximation at time step has in the 
v-th coordinate the sum of incoming messages at the node 
V, or formally 

FiediFj) 

Because the algorithm can be seen as performing block 
coordinate descent on a specific problem (discussed in the 
appendix), the message passing algorithm described above 
possesses strong convergence guarantees that depend on 
the structure of the factor graph. These guarantees even 
hold if all messages from nodes to factors, and all messages 
from factors to nodes are each computed in parallel. An 
important quantity that appears in the convergence rate is 
the maximal variable connectivity Ay = max^^y |^(^)|. 
Based on recent new results by Nishihara et al. (2014) 
on block coordinate descent for a similar dual (assuming 
that all factors depend on all variables, as considered 
by Jegelka et al. (2013)), we extend their analysis to obtain 
a linear convergence rate for our message passing scheme. 

Theorem 4 (Extension of Nishihara et al. (2014)). If the 
graph is Ay-regular, s.t. every variable appears in exactly 
Ay factors, then the message passing algorithm converges 
linearly with rate (1 — py^:^) • More specifically 

||q‘ - q*|| < 2||q° - q^lUv^^ ” 

where q* is the optimal point, q^ is the initial point and E 
is the number of edges in the factor graph. 


7. Experiments 

We now report experimental results on applying our par¬ 
allel variational inference scheme to a challenging image 
segmentation problem as motivated in §2. The goal of our 
experiments is to test the scalability of our approach to 
large problems, and to evaluate the quality of the marginals 
both qualitatively and quantitatively. We used the data 
from Jegelka & Bilmes (201 1), which contains a total of 36 
images, each with a highly detailed (pixel-level precision) 
ground truth segmentation. Due to intractability, we can¬ 
not compute the exact marginals against which we would 
ideally wish to compare. As a proxy for measuring the 
quality of the approximations, we use the area under the 
ROC curve (AUC) as compared to the ground truth seg¬ 
mentation. We classify each pixel independently as fore- or 
background by comparing its approximate marginal against 
a threshold, which we vary to obtain the ROC curve. We 
have used the following model, which contains both pair¬ 
wise and higher-order interactions. 

F{A) = am{A) + (3Foui{A) +'y ^2 ^ ’ 

where 

• the unary potentials m(') were learned from labeled 
data using a 5 component GMM; 

• the pairwise potentials Fcut connect neighboring pix¬ 
els X and x' with weights w(x,x^) = exp(— 6 >||x — 
XI p), where x and x' are the RGB values of the pixels; 

• the higher order potentials were generated using the 
mean-shift algorithm of Comaniciu & Meer (2002). 
We have used two overlapping layers of superpixels, 
each layer with different granularity. The concave 
function was defined as (/)(z) = z(l — z). 

We compared the following inference techniques. The 
reported typical running times are for an image of size 
427x640 pixels on a quad core machine and we report the 
wall clock time of the inference code (without setting up 
the factor graph or generating the superpixels). 

• Unary potentials only with independent predictions, 
i.e., /3 = 7 = 0 . 

• Belief propagation (BP), mean-field (ME) and frac¬ 
tional belief propagation (EBP) for the pairwise 
model (i.e. 7 = 0). We have used the implementation 
from libDAI (Mooij, 2010). The maximum number 
of iterations was set to 30. We note that this code is 
not parallelized. When we observe fast convergence, 
for example BP can converge in 3 iterations, it takes 
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(a) Original image. (b) FBP 1. (c) FBP 2. (d) FBP 3. (e) FBP 4. (f) HOP 1. 



(g) Ground truth. (h) DR 1. (i) DR 2. (j) DR 3. (k) DR 4. (1) HOP 2. 


Figure 2: Example marginals from the different approximation procedures for the original image (a) with ground truth 
segmentation (g). For the results comparing FBP and DR (b-e,h-k) we have used the same pairwise weights and weights. 
The strength of the prior used monotonically decreases with the number specified after the respective method (i.e., b,f,h 
correspond to strong and e,k,l, to weak priors). Note how FBP is overconfident, whereas our method offers marginals with 
much higher dynamic range. 


about 45 seconds. Even though we have set a rela¬ 
tively low number of iterations, the running times can 
be extremely slow if the methods do not converge. For 
example, running mean-field for 30 iterations can take 
more than 3 minutes. 

• Our approach using only pairwise potentials (7 = 0), 
solved using the total variation Douglas-Rachford 
(DR) code from (Barbero & Sra, 2011; 2014; Jegelka 
et al., 2013). We ran for at most 100 iterations. The 
inference takes typically less than a second. 

• Our approach with higher order potentials (HOP) only 
(/5 = 0). The inference takes less than 12 seconds. 

For every method we tested several variants using different 
combinations for a, /3 ,7 and 0 (exact numbers provided in 
the appendix). Then, we performed a leave-one-out cross- 
validation for estimating the average AUC. We have also 
generated a sequence of 10 trimaps by growing the bound¬ 
ary around the true foreground to estimate accuracy over 
the hardest pixels, namely those at the boundary. 

Accuracy. We first wish to quantitatively compare the ac¬ 
curacy of the approximate marginals. We report the aggre¬ 
gate results in Figure 3, and the ROC curves in Figure 4. 


Method 

Avg. AUC 

Std. Dev. 

Avg. AUCT 

Std. Dev. 

HOP 

0.9638 

0.0596 

0.9567 

0.0642 

DR 

0.9602 

0.0617 

0.9496 

0.0691 

FBP 

0.9500 

0.0662 

0.9357 

0.0975 

MF 

0.9486 

0.0675 

0.9420 

0.0764 

UNARY 

0.9484 

0.0681 

0.9425 

0.0759 

BP 

0.9445 

0.0779 

0.9360 

0.0964 


Figure 3: Average scores of the methods estimated using 
leave-one-out cross validation. The Avg. AUC column is 
the average area under the ROC curve. The Avg. AUCT 
column reports the average of the mean AUC over the 10 
trimaps. The second and the fourth columns are the stan¬ 
dard deviation of the preceding columns. 

We can clearly see that our approach outperforms the tra¬ 
ditional inference methods for both objectives — the AUC 
over the whole image and over the challenging boundary 
(trimaps). Sometimes we see very poor behavior of the al¬ 
ternative methods, which can be attributed to either their 
over-confidence (as verified below), or the fact that they 
optimize non-convex objectives and can fail to converge 
within the given number of iterations. Fastly, capturing 
high-order interactions leads to higher accuracy (in partic¬ 
ular around the boundary) than pairwise potentials only. 





















Scalable Variational Inference in Log-supermodular Models 



(a) Average ROC curves for the different methods. 


(b) Average AUC for trimaps of increasing size. 


Figure 4: Comparison of inference methods in terms of their accuracy. For each method we optimize the parameters via 
grid-search, and report leave-one-out cross-validation results, (a) ROC curves for classifying pixels as fore- or background 
by independently thresholding marginals, averaged over the whole image, (b) Results over the trimaps (blurred boundaries 
around the ground truth segmentation), focusing on “difficult” pixels. For every algorithm and every of the 10 trimap sizes 
we report the average area under the curve. 


Properties of marginals. We would also like to un¬ 
derstand the qualitative characteristics of the resulting 
marginals of our methods when compared with the tra¬ 
ditional techniques. From the discussion on the diver¬ 
gence minimization in §5, we would expect the approx¬ 
imate marginals to avoid assigning low probabilities and 
rather prefer to err conservatively, i.e., on the side of caus¬ 
ing false positives. On the other hand, it is known that 
the results of belief propagation are often over-confident. 
For this purpose, we provide a visual comparison in Fig¬ 
ure 2. Namely, each of the four FBP/DR pairs are results 
using the respective algorithms for the same parameters of 
the model. We observe exactly what the theory predicts — 
the distribution obtained via L-Field is less concentrated 
around the object and mass is spread around more. The 
contrast is starkest on Figures 2 (b) and (h), where we use 
a very strong pairwise prior (high /S). On Figures 2(e) and 
(k) we have used a very weak pairwise prior (low /3), and as 
expected the resulting marginals are mainly determined by 
the unary part and the choice of inference procedure does 
not make a difference. The results in the last column are 
from the higher order model, with two different values of 7 
(the strength of the higher order potential). We can see that 
the resulting probabilities better preserve the boundaries of 
the object and the fine details, which is one of the main 
benefits of using these models. 


8. Conclusion 

We have addressed the problem of variational inference in 
log-supermodular distributions. In particular, building on 
the L-Field approach of Djolonga & Krause (2014), we 
established two natural, important interpretations of their 
method. First, we showed how L-Field can be reduced 
to solving the well-studied minimum norm point problem, 
making a wealth of tools from submodular optimization 
suddenly available for approximate Bayesian inference. 
Secondly, we showed that the factorized distributions 
returned by L-Field minimize a particular type of infor¬ 
mation divergence. Both of these theoretical connections 
are immediately algorithmically useful. In particular, 
for the common case of decomposable models, both 
connections lead to efficient message passing algorithms. 
Exploiting the minimum norm connection, we proved 
strong convergence rates for a natural parallel approach, 
with convergence rates dependent on the factor graph struc¬ 
ture. Lastly, we demonstrate our approach on a challenging 
image segmentation task. Our results demonstrate the 
accuracy of our marginals (in terms of AUC score) com¬ 
pared to those produced by classical techniques like belief 
propagation, mean field and variants, on models where 
these can be applied. We also show that performance can 
be further improved by moving to high-order potentials, 
leading to models where classical marginal inference tech¬ 
niques become intractable. We believe our results provide 
an important step towards practical, efficient inference in 
models with complex, high-order variable interactions. 
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A. Theory 

A.l. Equivalence between the minimum norm and inference problems 

We will show a stronger result from which Theorem 2 follows by taking wi = 1 and = 0. 

Lemma 3. For positive weights Wi > 0 the objectives ~ ViY Siev log(exp(—+ exp(— 

have the same optimum under the constraint x G B{F). 

Proof. For the second objective we have that 

^ log{ex.p{-WiXi) + ex.p{-Wiyi)) ^ log{exp{wiZi) + ex.p{-Wiyi)). 

iev * iev * 

Hence, the optimum of the problem is the negative of the problem on —B{F), which is the base polytope of the submodular 
function F{A) = F{V — A) — F{V). The gradient of this objective with respect to any Zi is equal to a{wi{zi + yi)), 
where a{u) = 1/{1 F e~^) is the sigmoid function. As the weights Wi are positive, we have that 


a{wi{zi + yi)) < cr{wj{zj + yj)) Wi{zi + yi) < Wj{zj + yj). 

From Fujishige (2005)[Theorem 8.1], it follows that the above problem and Wi{zi yi)‘^ have the same solution z* 
on—5(F) = B{F). However, if z* is the projection of — y onto —B{F), then x = — z* is the projection of y onto 
5(F). □ 

A.2. Connection to the infinite Renyi divergences 

If we expand the Renyi infinite divergence for 5(5) = ^ exp(—F(5)) and Q{S) = ^ exp(—g(5)) for modular q{') wq 
get the following 

Doo{P II Q) = log sup = log sup + sup{g(A) - F(A)}. (7) 

ACV Q[2i) ACV — zSq ACV 


As we will consider the problem of minimizing the above quantity with respect to the modular function q we can ignore 
the constant log Zp. We will also expand the log-partition function of q, log Zq = '^i^y log(l + exp(—g^)) and introduce 
a new variable t capturing the supremum above to arrive at the following formulation. 

minimize log(l + exp(—g^)) + t 

iev ( 8 ) 

subject to q{A) < F{A) + t for ail A CV 

Definition 4. For any normalized submodular function F, define C{F) to be the optimum value of minimizing 
^i^y log(l + exp(—5i)) subject to s e B{F). 

To show the connection, we will need the following two lemmas. 

Lemma 4 (Djolonga & Krause (2014)). By strong Fenchel duality we have that 

C*{F) = min Vlog(l + exp(-Si)) = sup H[p] - /(p), 

seB(F)^^ pe[o.i]v 

where f is the Lovdsz extension of F andWjp] is the entropy of a random vector of independent Bernoulli random variables. 


The following lemma is already known, as such functions have been already used, but we prove it for completeness. 
Lemma 5. For any normalized submodular function F : 2^ ^ M define F^ : 2^ ^ M as follows. 


ms) = 


0 ifS = 0 

F(5)+^ 


Then, Fj^ is submodular, normalized and has Lovdsz extension //3(w) = /(w) + p max^ Wi. 
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Proof. To show that F is submodular we need to show that (see e.g. (Bach, 2013)[Prop. 2.3]) for any A C V and any j, k G 
V — Awe have that Fp{A U {k}) — Fp{A) > Fp{AL} {j, k}) — Fp{AU {j}). If A then the above inequality follows 
immediately from the submodularity of F. Otherwise, we have the inequality F{{k}) + /3-F(0) >F({i,fc})-F({i}), 
which again easily follows from the submodularity of F and the fact that P >0. The Lovasz extension of F^ is defined as 

k=l 


where the indices are chosen so that Wj^ > Wj^ > ... > Then, every term in the sum will be the same if we 

replace with F (because /3 will be added and subtracted), except for the first term when k = 1. This term is equal to 
F ^ — F{9)), and the result follows immediately as Wj^ = max^ Wi. □ 

Lemma 6. For submodular functions Problem ( 8 ) reaches the minimum for t = 0. 

Proof Define OPT /3 to be the optimum value of problem ( 8 ) for t = p. Note that OPT /3 = C*{Fj 3 ) + p. Then 

OPTo H[p] - /(p) < H[p] - /(p) + ^ (1 - maxpi) H[p] - /^^(p) + /3 < £*(//?) + /3 = OPT;?. 

I 

^ ^ ^ 

>0 


□ 

And this immediately implies Lemma 1 and Theorem 3. We will now prove Lemma 2 by giving a specific counterexample. 
Lemma 7. There is a supermodular function for which the minimum is achieved for some t > 0. 

Proof For t = 0, the Lagrange dual problem of Problem ( 8 ) is 

maximize — 

ACV iev A 3 i 

subject to 0 < ^ Aa < 1 for all i eV, 

A3i 

where h{p) = —plogp — (1 — p)log(l — p) is the binary entropy function (defined so that h{0) = h{l) = 0). The 
Lagrange dual is easily derived if we note that —h{—u) is the covex conjugate of the primal objective and use (Boyd & 
Vandenberghe, 2004)[§5.1.6]. Consider the function 

F( 0 ) = 0 , F({ 1 }) = - 20 , F({ 2 }) = -8, F({ 1 , 2 }) = -16, 

which is supermodular as 


F (1 I { 2 }) = -16 - (- 8 ) = -8 > -20 = F({ 1 }), and 
F (2 I {!}) = -16 - (-20) = 4 > 8 = F({2}). 


For t = 1 consider the primal feasible variable x = (-19,-7), which has an objective value < 27.1. For t = 0 take the 
dual variable A 0 = Ay = 0, Ai = 1 = A 2 = 1, with a value of — F(l)Ai — F(0)Ao + 0 = 28 > 27.1 and thus a strictly 
better value is achieved for t = 1 . □ 


B. Proofs for Section 6 

We will first define a dual formulation of the minimization problem, for which the claimed message passing scheme does 
BCD. We will work with the set of all valid Lagrange multipliers A = {A G : (Vi; G V) X]iG( 5 (v) ^^4 = 

and the product of base polytopes B = ’^i=iB{Fi). For any element y of either of these sets we will denote by the 
coordinate corresponding to variable v ^ Vi of the i-th block if < i < R). Moreover, for any vector x G we will 
denote by xj^ its restriction to the coordinates S FV. 
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Theorem 5 (Following (Jegelka et al., 2013)). The dual problem of 

1 \ 
minimizey:f{x) + = minimize^: y^(/(x|vj + -||x|vJo) 

i=l 


( 10 ) 


is equal to 


maximize ^ 11 yi - | 

AeA,yieB(F,)^ 2 


2 


( 11 ) 


Proof. The proof is based on that of (Jegelka et al., 2013) [Lemma 1]. The considered problem is the following. 

R 

2 II--IZ ■ 2 ' 


imn/(x) + i||x|^ = inin^(/i(xv,) + i||xv;||G) S-t- = x|v, 

R ^ 

= min max V(/j(xvJ + “ >H {^Vi -x|y)). 

x,xv, \i 2 


i=l 


Because we have zero duality gap, we can change the order of optimization. Then, if we optimize for x, we see that the 
Lagrange multipliers have to belong to A, which was defined above. Hence, we have the following problem 

y^min m^ (xl^yi + h|xyj^ - AfxyJ = max^] max min(x^yi + h|xy,||^ - AfxyJ. 

^^ XT/. ^r.cz-R(i7.\ ^ ? AgA ^—^yieB(Fi) Xv^. * / 


max 

AgA xv^ yieB(Fi) 


Consider the inner problem, i.e. 

minimizexv, x^.yi + ^||xy,l|^ - Afxy = minimizexv,. x^^^i “ + \\\^vA%, 

which is exactly the negative of the convex conjugate of 11| • || ^ evaluated at — y^. Because the convex conjugate is equal 
to ^ II • 11^^ (see e.g. (Boyd & Vandenberghe, 2004)[Ex. 3.22]), the above minimum is equal to -il|y^-AdlL> which we 
had to show. 

□ 


We can now easily see that the message passing algorithm is doing BCD for the dual — each node Fi is first projecting to 
A by subtracting the component-wise mean, and then clearly projecting onto B{Fi) under the norm defined in Definition 3. 

We will now show the linear convergence rate. Because we consider the /c-regular case, the primal can be written in the 
following simpler form 

^ 1 

minimize ^/(x|y.) + ^l|x|y||^ 

and the decomposed dual becomes the problem of finding the closest points between A and B 


maximize 

AGA,yi G.B(Fi) 




( 12 ) 


We now use exactly the same argument as in (Nishihara et al., 2014)[§3.3] with some small changes necessary to accom¬ 
modate our different definitions of A and B. Please refer to that paper and references therein for the terminology used in the 
remaining of the proof. We will show that the Friedrich’s angle between any two faces of B and A is at most which 

combined with (Nishihara et al., 2014)[Thm. 2 and Cor. 5] implies the theorem. To make the notation easier to parse, we 
will assume that the elements in B and A are ordered so that first come the | Vi \ elements corresponding to Fi , then the | V 2 1 
elements corresponding to F 2 and so forth. Under this ordering, the vector space A can be written as the nullspace of the 
following matrix 


S' 


Vk\ 


e^y ^ 


, where = [v = v']. 


(13) 
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As noted in (Nishihara et al., 2014), using (Bach, 2013)[Prop. 4.7] we can express the affine hull diSo{Bz) of any face Bz 
as (where for each i G {1 ,..., i?} the sets , Ar^Mr form a partition of Vi) 


R Mr 

affo(H.) = nn {(yi, • • • yr{Ar,i U ... U Ar,m) = 0}, where y* G B{Fi). 

r=l m=l 


This set can be also written as the nullspace of the following matrix 




T = 






y/\^R,i\ 
I 4 . . 


^/\^R,2\ 




y^\AR,M^\ ) 


To compute the Friedrich’s angle we are interested in the singular values of (Nishihara et al., 2014)[Lemma 6], which 
is equal to 

^rj.T ^ ll,Mi ^RMr ^ 

Hence, we have to analyze the eigenvalues of the square matrix {ST'^), whose rows and columns are indexed by 

I = {(r, m): r e [R],m e [M^]}, and whose ((r^, m^), (r^, m^)) entry is 

1 Fl Arp^^YYij I 

^ '\/\^ri,mi I l^Tj ,mj \ 

We create a graph with vertices I and we add edges between distinct (r^, m^) and {vj^rrij) with weight D Ar-^mj \ 

(zero weight means that we do not add that edge). The normalized graph Laplacian of this graph is equal to 


[^]( 


ri,mi),{rj,mj) 


\Ar 


.nAr 


k-1 




if {ri.rrii) = {rj.rrij) 

otherwise. 


Hence, {ST'^) = I — We want to lower-bound the Cheeger constant h of this graph. Because the spectrum 

of a graph is the union of the spectra of the connected components we will assume that the graph is connected, as we can 
apply the same argument to every component. From the definition of h we have that 


h > 2 ^ where volume = ^ |(5(i;)|(|(5(i;)| — 1) = \V\{k^ — k). 

v£V 


What remains is to bound the minimum cut from below. Because the graph is connected, for any cut U there must exist 
some V that is in sets on both sides of the cut. Let m be the number of sets in U that contain it, and let k — m be the number 
of sets in the complement that contain it. Then, the cut is of size is at least m{k — m) > k — 1. Hence 
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And by Cheeger’s inequality the smallest positive eigenvalue A 2 of the Laplacian C is at least which from the 

relationship [ST^)^{ST^) = / — implies that the squared Friedrich’s angle between A and B is at most 



which completes the proof. 


C. Experiments 

We have used the parameter values in the table below. 

Parameter Values 

0 0 . 1 , 0 . 001 , 0.0001 

a 1 , 0 . 1 , 0 . 01 , 0.001 

p 10 , 1 , 0 . 1 , 0 . 01 , 0.001 

7 10 , 1 , 0 . 1 , 0 . 01 , 0.001 




